Predictive Modeling of Weather Station Data:

Linear Regression vs. Graph Neural Networks

Author

Colby Fenters & Lilith Holland (Advisor: Dr. Cohen)

Published

July 28, 2025

Slides: slides.html ( Go to slides.qmd to edit)

Important

Remember: Your goal is to make your audience understand and care about your findings. By crafting a compelling story, you can effectively communicate the value of your data science project.

Carefully read this template since it has instructions and tips to writing!

Introduction

This section will be expanded as the modeling process is further refined

Accurate weather prediction is a crucial task with widespread implications across agriculture, transportation, disaster preparedness, and energy management. Traditional forecasting methods often rely on statistical models or physics-based simulations, however, with the advancement of graphical neural networks (GNN) we believe there is potential in a more modern deep learning approach.

In this project, we explore the predictive power of a traditional linear regression model and a GNN on real-world weather station data. Our aim is to evaluate whether the GNN’s ability to incorporate spatial relationships between stations offers a measurable advantage over more conventional techniques

The dataset consists of multiple weather stations located within the same geographic region. Each station collects meteorological variables over time, and can be represented as a node within a broader spatial network. For the linear model baseline, a single model will be trained using all stations’ data simultaneously, treating each station as an independent feature source.

For the GNN the model will be trained on the entire network of stations, where each node corresponds to a station and edges represent spatial relationships. The graph is encoded via a dense adjacency matrix, excluding self-connections. The GNN aims to leverage the inherent spatial structure of the data, potentially capturing regional weather patterns and inter-station dependencies that are invisible to traditional models.

Our evaluation focuses on forecasting performance over a 6-month test period at the end of the dataset. We asses how well each modelling approach predicts key weather variables and investigate the conditions under which one model may outperform the other.

Methods

This section will be expanded as the modeling process is further refined

This section outlines the modeling approaches, data structure, and training procedures used to compare the traditional linear model and the GNN on weather station data.

1. Data selection

Work in progress

2. Cleaning Process

Work in progress

3. Linear Model

The linear model is formulated as a time-series regression task. It uses the feature information from the previous four time steps to predict the feature values at the next time step. Each input consists of a concatenation of the five meteorological features across four sequential time steps, resulting in a fixed length input vector per prediction target. The five input features are:

  • Temperature
  • Relative Humidity
  • Wind Speed
  • Wind Direction (represented as sin and cosine components)

4. GNN

The GNN is designed to capture spatiotemporal dependencies in the weather station network. It is implemented using PyTorch and follows a structure inspired by the Diffusion Convolutional Recurrant Neural Network (DCRNN) architecture.

  • Architecture
    • Input Format:
      • Data is structured using the StaticGraphTemporalSignal format,

where each node represents a weather station and temporal sequences of node features are used for prediction.

  • Layers:
    • A DCRNN layer to capture spatial and temporal dependencies
    • A ReLU activation function
    • A Linear output layer for final prediction
  • Training Configuration
    • Optimizer
      • Adam
  • Learning Rate:
    • Base learning rate of 0.01 but will reduce by 0.1 at a plateau
  • Epochs:
    • Trained for a maximum of 100 epochs with an early exit callback

The model is trained to predict the same five features (temperature, relative humidity, wind speed, wind direction sin, wind direction cosine) for the next time step based on the preceding four time steps, analogous to the linear model.

Analysis and Results

Data Exploration and Visualization

  • Describe your data sources and collection process.

  • Present initial findings and insights through visualizations.

  • Highlight unexpected patterns or anomalies.

A study was conducted to determine how…

exit
Code
start_date = datetime.datetime(2010, 1, 1, 0, 0)
end_date = datetime.datetime(2020, 12, 31, 0, 0)

seed = 3435
split_index = 730

pl.enable_string_cache()

data_path = r'kansas_asos_2010_2020.csv'

shape: (30_667, 12)
station valid lat lon elevation tmpf dwpf relh sknt feel drct_sin drct_cos
cat datetime[μs] f32 f32 f32 f32 f32 f32 f32 f32 f64 f64
"GCK" 2018-01-01 00:00:00 37.927502 -100.724403 881.0 9.283334 -6.5 48.148335 8.666667 -4.161667 0.851117 0.524977
"LBL" 2018-01-01 00:00:00 37.044201 -100.9599 879.0 12.316667 -1.983333 52.014999 10.166667 -1.721667 0.664796 0.747025
"EHA" 2018-01-01 00:00:00 37.000801 -101.879997 1099.0 15.555555 5.255556 63.382778 7.388889 4.482222 0.970763 0.24004
"HQG" 2018-01-01 00:00:00 37.163101 -101.370499 956.52002 14.311111 -1.605556 48.468334 7.777778 2.681667 0.936332 0.351115
"3K3" 2018-01-01 00:00:00 37.991699 -101.7463 1005.700012 13.1 -0.9 53.127777 6.777778 2.151111 0.981255 0.192712
"EHA" 2020-12-31 00:00:00 37.000801 -101.879997 1099.0 42.355556 16.588888 34.955555 3.0 40.254444 -0.533205 -0.845986
"HQG" 2020-12-31 00:00:00 37.163101 -101.370499 956.52002 40.722221 13.255555 32.23111 1.666667 39.450001 0.977334 -0.211704
"3K3" 2020-12-31 00:00:00 37.991699 -101.7463 1005.700012 40.200001 12.2 31.377777 4.555555 36.683334 -0.700217 -0.71393
"JHN" 2020-12-31 00:00:00 37.578201 -101.7304 1012.710022 40.711113 17.4 38.554443 4.777778 37.06889 -0.824675 -0.565607
"19S" 2020-12-31 00:00:00 37.496899 -100.832901 892.570007 39.922222 15.388889 36.552223 6.111111 35.113335 -0.824675 0.565607
True

array([<Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >,
       <Axes: >], dtype=object)
array([<Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >,
       <Axes: >], dtype=object)

shape: (30_667, 6)
station tmpf relh sknt drct_sin drct_cos
cat f64 f64 f64 f64 f64
"GCK" -1.355721 0.481483 -0.080357 0.851117 0.524977
"LBL" -1.265174 0.52015 0.160714 0.664796 0.747025
"EHA" -1.168491 0.633828 -0.285714 0.970763 0.24004
"HQG" -1.205638 0.484683 -0.223214 0.936332 0.351115
"3K3" -1.241791 0.531278 -0.383929 0.981255 0.192712
"EHA" -0.368491 0.349556 -0.991072 -0.533205 -0.845986
"HQG" -0.417247 0.322311 -1.205357 0.977334 -0.211704
"3K3" -0.432836 0.313778 -0.741072 -0.700217 -0.71393
"JHN" -0.417579 0.385544 -0.705357 -0.824675 -0.565607
"19S" -0.441128 0.365522 -0.491072 -0.824675 0.565607
<All keys matched successfully>
RecurrentGCN(
  (recurrent1): DCRNN(
    (conv_x_z): DConv(204, 64)
    (conv_x_r): DConv(204, 64)
    (conv_x_h): DConv(204, 64)
  )
  (recurrent2): DCRNN(
    (conv_x_z): DConv(96, 32)
    (conv_x_r): DConv(96, 32)
    (conv_x_h): DConv(96, 32)
  )
  (recurrent3): DCRNN(
    (conv_x_z): DConv(64, 32)
    (conv_x_r): DConv(64, 32)
    (conv_x_h): DConv(64, 32)
  )
  (linear): Linear(in_features=32, out_features=1, bias=True)
)
MSE: 0.0562
RecurrentGCN(
  (recurrent1): DCRNN(
    (conv_x_z): DConv(204, 64)
    (conv_x_r): DConv(204, 64)
    (conv_x_h): DConv(204, 64)
  )
  (recurrent2): DCRNN(
    (conv_x_z): DConv(96, 32)
    (conv_x_r): DConv(96, 32)
    (conv_x_h): DConv(96, 32)
  )
  (recurrent3): DCRNN(
    (conv_x_z): DConv(64, 32)
    (conv_x_r): DConv(64, 32)
    (conv_x_h): DConv(64, 32)
  )
  (linear): Linear(in_features=32, out_features=1, bias=True)
)

RecurrentGCN(
  (recurrent1): DCRNN(
    (conv_x_z): DConv(204, 64)
    (conv_x_r): DConv(204, 64)
    (conv_x_h): DConv(204, 64)
  )
  (recurrent2): DCRNN(
    (conv_x_z): DConv(96, 32)
    (conv_x_r): DConv(96, 32)
    (conv_x_h): DConv(96, 32)
  )
  (recurrent3): DCRNN(
    (conv_x_z): DConv(64, 32)
    (conv_x_r): DConv(64, 32)
    (conv_x_h): DConv(64, 32)
  )
  (linear): Linear(in_features=32, out_features=1, bias=True)
)

LinearRegression()
MSE: 0.0147
[<matplotlib.lines.Line2D object at 0x0000023610CEE110>]
[<matplotlib.lines.Line2D object at 0x000002361116D010>]
Text(0.5, 1.0, 'Station GCK')
[<matplotlib.lines.Line2D object at 0x0000023610FD5CD0>]
[<matplotlib.lines.Line2D object at 0x0000023610FD7050>]
Text(0.5, 1.0, 'Station JHN')
[<matplotlib.lines.Line2D object at 0x0000023611078E50>]
[<matplotlib.lines.Line2D object at 0x0000023611078890>]
Text(0.5, 1.0, 'Station LBL')
[<matplotlib.lines.Line2D object at 0x0000023611078A50>]
[<matplotlib.lines.Line2D object at 0x000002361107BD90>]
Text(0.5, 1.0, 'Station HQG')
[<matplotlib.lines.Line2D object at 0x0000023611078FD0>]
[<matplotlib.lines.Line2D object at 0x00000236110782D0>]
Text(0.5, 1.0, 'Station 19S')
[<matplotlib.lines.Line2D object at 0x000002361105CF10>]
[<matplotlib.lines.Line2D object at 0x000002361105DE10>]
Text(0.5, 1.0, 'Station EHA')
[<matplotlib.lines.Line2D object at 0x000002361105E690>]
[<matplotlib.lines.Line2D object at 0x000002361105EA90>]
Text(0.5, 1.0, 'Station 3K3')
<matplotlib.legend.Legend object at 0x0000023610F89D50>
Text(0.5, 0.98, 'LR Actual vs Predicted Over Time for Each Node')

[<matplotlib.lines.Line2D object at 0x0000023611088E10>]
Text(0.5, 1.0, 'Station GCK')
[<matplotlib.lines.Line2D object at 0x0000023610C766D0>]
Text(0.5, 1.0, 'Station JHN')
[<matplotlib.lines.Line2D object at 0x0000023610D01390>]
Text(0.5, 1.0, 'Station LBL')
[<matplotlib.lines.Line2D object at 0x00000236110DAC10>]
Text(0.5, 1.0, 'Station HQG')
[<matplotlib.lines.Line2D object at 0x0000023610B8AE10>]
Text(0.5, 1.0, 'Station 19S')
[<matplotlib.lines.Line2D object at 0x0000023610B8B8D0>]
Text(0.5, 1.0, 'Station EHA')
[<matplotlib.lines.Line2D object at 0x0000023610B89D50>]
Text(0.5, 1.0, 'Station 3K3')
<matplotlib.legend.Legend object at 0x0000023610E863D0>
Text(0.5, 0.98, 'LR Absolute Error for Each Station')

RecurrentGCN(
  (recurrent1): DCRNN(
    (conv_x_z): DConv(204, 64)
    (conv_x_r): DConv(204, 64)
    (conv_x_h): DConv(204, 64)
  )
  (recurrent2): DCRNN(
    (conv_x_z): DConv(96, 32)
    (conv_x_r): DConv(96, 32)
    (conv_x_h): DConv(96, 32)
  )
  (recurrent3): DCRNN(
    (conv_x_z): DConv(64, 32)
    (conv_x_r): DConv(64, 32)
    (conv_x_h): DConv(64, 32)
  )
  (linear): Linear(in_features=32, out_features=1, bias=True)
)
[<matplotlib.lines.Line2D object at 0x0000023610D4C810>]
[<matplotlib.lines.Line2D object at 0x000002360AA6B310>]
Text(0.5, 1.0, 'Station GCK')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x0000023610F58D10>]
[<matplotlib.lines.Line2D object at 0x000002360AD96F10>]
Text(0.5, 1.0, 'Station JHN')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002360AD950D0>]
[<matplotlib.lines.Line2D object at 0x000002360AD96C90>]
Text(0.5, 1.0, 'Station LBL')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002360AD95610>]
[<matplotlib.lines.Line2D object at 0x000002360AD97110>]
Text(0.5, 1.0, 'Station HQG')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002360A8AD0D0>]
[<matplotlib.lines.Line2D object at 0x000002360AD8C750>]
Text(0.5, 1.0, 'Station 19S')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002360AD8DB10>]
[<matplotlib.lines.Line2D object at 0x000002360AD8F0D0>]
Text(0.5, 1.0, 'Station EHA')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002360AD8D910>]
[<matplotlib.lines.Line2D object at 0x000002360AD8C890>]
Text(0.5, 1.0, 'Station 3K3')
(0.0, 500.0)
<matplotlib.legend.Legend object at 0x000002360AA5CF50>
Text(0.5, 0.98, 'GNN vs. LR Absolute Error for Each Station')

Modeling and Results

  • Explain your data preprocessing and cleaning steps.

  • Present your key findings in a clear and concise manner.

  • Use visuals to support your claims.

  • Tell a story about what the data reveals.

Conclusion

  • Summarize your key findings.

  • Discuss the implications of your results.

References